Skip to content

Fixes fragment loading to properly pack 16b values into a 32b register#2

Merged
diptorupd merged 4 commits intoamd-integrationfrom
fix/load_fragment
Oct 1, 2025
Merged

Fixes fragment loading to properly pack 16b values into a 32b register#2
diptorupd merged 4 commits intoamd-integrationfrom
fix/load_fragment

Conversation

@diptorupd
Copy link
Copy Markdown
Collaborator

The PR fixes a bug in the HIP MMA load_fragment function. The existing implementation was incorrectly packing two 16b half values into a 32b register in a Big endian format. Since CDNA3 is a little endian ISA, it was causing incorrect values to be populated in the registers.

__half a[4];
uint32_t r1, r2;
// Existing logic: r1 = a[0],a[1]
// Fixed logic:    r1 = a[1],a[0]

The PR:

  • fixes the load_fragment function and simplifies the logic to not require bitwise shifts.
  • fixes the transpose_4x4_half_registers function that was implemented to work based on the incorrect logic.
  • Removed the load_fragment_transpose function as it is not required.
  • Updates the test_mfma_fp32_16x16x16fp16.cpp to use the transpose_4x4_half_registers function.

PS: The test_mfma_fp32_16x16x16fp16.cpp also implements the same compute as needed in compute_sfm_v and correctly implements the traversal logic for the B-matrix.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a critical endianness bug in the HIP MMA fragment loading functions. The implementation was incorrectly packing 16-bit half values into 32-bit registers in big-endian format instead of the little-endian format required by CDNA3 ISA.

  • Fixed load_fragment function to use direct memory access instead of bit shifting
  • Corrected transpose_4x4_half_registers function to work with the fixed endianness
  • Removed obsolete load_fragment_transpose function and replaced its usage with proper transposition

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
libflashinfer/include/gpu_iface/backend/hip/mma_hip.h Core fix for endianness bug in fragment loading and transpose operations
libflashinfer/include/gpu_iface/mma_ops.hpp Removed obsolete transpose function and fixed template parameter
libflashinfer/tests/hip/test_mfma_fp32_16x16x16fp16.cpp Updated test to use corrected functions and improved indexing logic

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 37 to 41
template <typename T>
__device__ __forceinline__ void load_fragment_transpose_4x4_half_registers(uint32_t* R,
const T* smem_ptr) {
static_assert(std::is_same<T, int>::value,
static_assert(std::is_same<T, __half>::value,
"Only __half is supported for the 4x4 register transpose");
Copy link

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static_assert checks for type T but the function call always uses __half. This creates a mismatch - either the template parameter T should be used in the function call, or the static_assert should be removed since T is not used.

Copilot uses AI. Check for mistakes.
__device__ __forceinline__ void load_fragment_4x4_half_registers(uint32_t* R, const T* smem_ptr) {
static_assert(std::is_same_v<T, __half>(), "Only half type is supported");
// Each thread loads 4 __half values in two 32b registers.
static_assert(std::is_same_v<T, __half>, "Only half type is supported");
Copy link

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static_assert uses std::is_same_v which requires C++17, but the codebase appears to use C++14/C++11 style elsewhere (e.g., std::is_same<T, __half>::value in the other file). This should be consistent with the project's C++ standard.

Suggested change
static_assert(std::is_same_v<T, __half>, "Only half type is supported");
static_assert(std::is_same<T, __half>::value, "Only half type is supported");

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@rtmadduri rtmadduri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these changes look good to me.

@diptorupd diptorupd merged commit bec5bdf into amd-integration Oct 1, 2025
1 check failed
@diptorupd diptorupd deleted the fix/load_fragment branch October 2, 2025 16:45
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
This PR introduces a patch to includes

Tested with Unit Tests:

Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/4 Test #1: MathTest .........................   Passed    3.25 sec
    Start 2: PosEncTest
2/4 Test #2: PosEncTest .......................   Passed    3.25 sec
    Start 3: CascadeTest
3/4 Test #3: CascadeTest ......................   Passed    3.24 sec
    Start 4: PageTest
4/4 Test #4: PageTest .........................   Passed  161.15 sec

100% tests passed, 0 tests failed out of 4
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
diptorupd added a commit that referenced this pull request Dec 5, 2025
#2)

* Fix load_fragment

* Fix Array OOO access in debug function

* Fixes based on Copilot review.

* More Copilot review based fixes
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
This PR introduces a patch to includes

Tested with Unit Tests:

Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/4 Test ROCm#1: MathTest .........................   Passed    3.25 sec
    Start 2: PosEncTest
2/4 Test ROCm#2: PosEncTest .......................   Passed    3.25 sec
    Start 3: CascadeTest
3/4 Test ROCm#3: CascadeTest ......................   Passed    3.24 sec
    Start 4: PageTest
4/4 Test ROCm#4: PageTest .........................   Passed  161.15 sec

100% tests passed, 0 tests failed out of 4
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test ROCm#1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test ROCm#2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test ROCm#3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test ROCm#4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test ROCm#1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test ROCm#2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test ROCm#3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test ROCm#4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test ROCm#1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test ROCm#2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test ROCm#3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test ROCm#4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test ROCm#5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test ROCm#6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test ROCm#7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test ROCm#8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
ROCm#2)

* Fix load_fragment

* Fix Array OOO access in debug function

* Fixes based on Copilot review.

* More Copilot review based fixes
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
This PR introduces a patch to includes

Tested with Unit Tests:

Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/4 Test #1: MathTest .........................   Passed    3.25 sec
    Start 2: PosEncTest
2/4 Test #2: PosEncTest .......................   Passed    3.25 sec
    Start 3: CascadeTest
3/4 Test #3: CascadeTest ......................   Passed    3.24 sec
    Start 4: PageTest
4/4 Test #4: PageTest .........................   Passed  161.15 sec

100% tests passed, 0 tests failed out of 4
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
diptorupd added a commit that referenced this pull request Feb 2, 2026
#2)

* Fix load_fragment

* Fix Array OOO access in debug function

* Fixes based on Copilot review.

* More Copilot review based fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants